Goto

Collaborating Authors

 teacher network



How a student becomes a teacher: learning and forgetting through Spectral methods

Neural Information Processing Systems

The above scheme proves particularly relevant when the student network is overparameterized (namely, when larger layer sizes are employed) as compared to the underlying teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network.


How a student becomes a teacher: learning and forgetting through Spectral methods

Neural Information Processing Systems

The above scheme proves particularly relevant when the student network is overparameterized (namely, when larger layer sizes are employed) as compared to the underlying teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network.



Supplementary Document to the Paper " Efficient V ariational Inference for Sparse Deep Learning with Theoretical Guarantee "

Neural Information Processing Systems

As a technical tool for the proof, we first restate the Lemma 6.1 in Chérief-Abdellatif and Alquier The first inequality is due to Lemma 1.1 and the second Under Condition 4.1 - 4.2, we have the following lemma that shows the existence of testing functions Now we define φ " max Note that log K " log N pε Hence we conclude the proof. We start with the first component. Pati et al. (2018), it could be shown ż 's, the third term in the RHS of (9) is bounded by 3 2nσ Similarly, the fifth term in the RHS of (9) is bounded by O p 1{n q. The convergence under squared Hellinger distance is directly result of Lemma 4.1 and 4.2, by As mentioned by Sønderby et al. (2016) and Molchanov et al. (2017), training sparse The optimization method used is Adam. The implementation details for UCI datasets and MNIST can be found in Section 2.5 and 2.6 In this section, we aim to demonstrate that there is little difference between the results using inverse-CDF reparameterization and Gumbel-softmax approximation via a toy example.


How a Student becomes a Teacher: learning and forgetting through Spectral methods

Neural Information Processing Systems

In theoretical Machine Learning, the teacher-student paradigm is often employed as an effective metaphor for real-life tuition. A student network is trained on data generated by a fixed teacher network until it matches the instructor's ability to cope with the assigned task. The above scheme proves particularly relevant when the student network is overparameterized (namely, when larger layer sizes are employed) as compared to the underlying teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network.


Paraphrasing Complex Network: Network Compression via Factor Transfer

Neural Information Processing Systems

Many researchers have sought ways of model compression to reduce the size of a deep neural network (DNN) with minimal performance degradation in order to use DNNs in embedded systems. Among the model compression methods, a method called knowledge transfer is to train a student network with a stronger teacher network. In this paper, we propose a novel knowledge transfer method which uses convolutional operations to paraphrase teacher's knowledge and to translate it for the student. This is done by two convolutional modules, which are called a paraphraser and a translator. The paraphraser is trained in an unsupervised manner to extract the teacher factors which are defined as paraphrased information of the teacher network. The translator located at the student network extracts the student factors and helps to translate the teacher factors by mimicking them. We observed that our student network trained with the proposed factor transfer method outperforms the ones trained with conventional knowledge transfer methods.


Bi-directional Weakly Supervised Knowledge Distillation for Whole Slide Image Classification

Neural Information Processing Systems

Computer-aided pathology diagnosis based on the classification of Whole Slide Image (WSI) plays an important role in clinical practice, and it is often formulated as a weakly-supervised Multiple Instance Learning (MIL) problem. Existing methods solve this problem from either a bag classification or an instance classification perspective. In this paper, we propose an end-to-end weakly supervised knowledge distillation framework (WENO) for WSI classification, which integrates a bag classifier and an instance classifier in a knowledge distillation framework to mutually improve the performance of both classifiers. Specifically, an attention-based bag classifier is used as the teacher network, which is trained with weak bag labels, and an instance classifier is used as the student network, which is trained using the normalized attention scores obtained from the teacher network as soft pseudo labels for the instances in positive bags. An instance feature extractor is shared between the teacher and the student to further enhance the knowledge exchange between them. In addition, we propose a hard positive instance mining strategy based on the output of the student network to force the teacher network to keep mining hard positive instances. WENO is a plug-and-play framework that can be easily applied to any existing attention-based bag classification methods. Extensive experiments on five datasets demonstrate the efficiency of WENO. Code is available at https://github.com/miccaiif/WENO.


Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

Neural Information Processing Systems

Most existing works in few-shot learning rely on meta-learning the network on a large base dataset which is typically from the same domain as the target dataset. We tackle the problem of cross-domain few-shot learning where there is a large shift between the base and target domain. The problem of cross-domain few-shot recognition with unlabeled target data is largely unaddressed in the literature. STARTUP was the first method that tackles this problem using self-training. However, it uses a fixed teacher pretrained on a labeled base dataset to create soft labels for the unlabeled target samples.


A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point

Couto, Carlos, Mourão, José, Figueiredo, Mário A. T., Ribeiro, Pedro

arXiv.org Machine Learning

Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.